{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 大众点评评价情感分析~\n",
    "先上结果：\n",
    "\n",
    "| 糖水店的评论文本                             | 模型预测的情感评分 |\n",
    "| :------------------------------------------- | :----------------- |\n",
    "| '糖水味道不错，滑而不腻，赞一个，下次还会来' | 0.91               |\n",
    "| '味道一般，没啥特点'                         | 0.52               |\n",
    "| '排队老半天，环境很差，味道一般般'           | 0.05               |\n",
    "\n",
    "模型的效果还可以的样子，yeah~接下来我们好好讲讲怎么做的哈，我们通过爬虫爬取了大众点评广州8家最热门糖水店的3W条评论信息以及评分作为训练数据，前面的分析我们得知*样本很不均衡*。接下来我们的整体思路就是：文本特征处理(分词、去停用词、TF-IDF)—机器学习建模—模型评价。\n",
    "\n",
    "我们先不处理样本不均衡问题，直接建模后查看结果，接下来我们再按照两种方法处理样本不均衡，对比结果。\n",
    "\n",
    "### 数据读入和探索"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cus_id</th>\n",
       "      <th>comment_time</th>\n",
       "      <th>comment_star</th>\n",
       "      <th>cus_comment</th>\n",
       "      <th>kouwei</th>\n",
       "      <th>huanjing</th>\n",
       "      <th>fuwu</th>\n",
       "      <th>shopID</th>\n",
       "      <th>stars</th>\n",
       "      <th>year</th>\n",
       "      <th>month</th>\n",
       "      <th>weekday</th>\n",
       "      <th>hour</th>\n",
       "      <th>comment_len</th>\n",
       "      <th>cus_comment_processed</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>迷糊泰迪</td>\n",
       "      <td>2018-09-20 06:48:00</td>\n",
       "      <td>sml-str40</td>\n",
       "      <td>南信 算是 广州 著名 甜品店 吧 好几个 时间段 路过 都 是 座无虚席 看着 餐单 上 ...</td>\n",
       "      <td>非常好</td>\n",
       "      <td>好</td>\n",
       "      <td>好</td>\n",
       "      <td>518986</td>\n",
       "      <td>4.0</td>\n",
       "      <td>2018</td>\n",
       "      <td>9</td>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "      <td>184</td>\n",
       "      <td>南信 算是 广州 著名 甜品店 吧 好几个 时间段 路过 都 是 座无虚席 看着 餐单 上 ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>稱霸幼稚園</td>\n",
       "      <td>2018-09-22 21:49:00</td>\n",
       "      <td>sml-str40</td>\n",
       "      <td>中午 吃 完 了 所谓 的 早茶 回去 放下 行李 休息 了 会 就 来 吃 下午茶 了 服...</td>\n",
       "      <td>很好</td>\n",
       "      <td>很好</td>\n",
       "      <td>很好</td>\n",
       "      <td>518986</td>\n",
       "      <td>4.0</td>\n",
       "      <td>2018</td>\n",
       "      <td>9</td>\n",
       "      <td>5</td>\n",
       "      <td>21</td>\n",
       "      <td>266</td>\n",
       "      <td>中午 吃 完 了 所谓 的 早茶 回去 放下 行李 休息 了 会 就 来 吃 下午茶 了 服...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>爱吃的美美侠</td>\n",
       "      <td>2018-09-22 22:16:00</td>\n",
       "      <td>sml-str40</td>\n",
       "      <td>冲刺 王者 战队 吃遍 蓉城 战队 有 特权 五月份 和 好 朋友 毕业 旅行 来 了 广州...</td>\n",
       "      <td>很好</td>\n",
       "      <td>很好</td>\n",
       "      <td>很好</td>\n",
       "      <td>518986</td>\n",
       "      <td>4.0</td>\n",
       "      <td>2018</td>\n",
       "      <td>9</td>\n",
       "      <td>5</td>\n",
       "      <td>22</td>\n",
       "      <td>341</td>\n",
       "      <td>冲刺 王者 战队 吃遍 蓉城 战队 有 特权 五月份 和 好 朋友 毕业 旅行 来 了 广州...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>姜姜会吃胖</td>\n",
       "      <td>2018-09-19 06:36:00</td>\n",
       "      <td>sml-str40</td>\n",
       "      <td>都 说来 广州 吃 糖水 就要 来南信 招牌 姜撞奶 红豆 双皮奶 牛 三星 云吞面 一楼 ...</td>\n",
       "      <td>非常好</td>\n",
       "      <td>很好</td>\n",
       "      <td>很好</td>\n",
       "      <td>518986</td>\n",
       "      <td>4.0</td>\n",
       "      <td>2018</td>\n",
       "      <td>9</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>197</td>\n",
       "      <td>都 说来 广州 吃 糖水 就要 来南信 招牌 姜撞奶 红豆 双皮奶 牛 三星 云吞面 一楼 ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>forevercage</td>\n",
       "      <td>2018-08-24 17:58:00</td>\n",
       "      <td>sml-str50</td>\n",
       "      <td>一直 很 期待 也 最 爱 吃 甜品 广州 的 甜品 很 丰富 很 多样 来 之前 就 一直...</td>\n",
       "      <td>非常好</td>\n",
       "      <td>很好</td>\n",
       "      <td>很好</td>\n",
       "      <td>518986</td>\n",
       "      <td>5.0</td>\n",
       "      <td>2018</td>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>17</td>\n",
       "      <td>261</td>\n",
       "      <td>一直 很 期待 也 最 爱 吃 甜品 广州 的 甜品 很 丰富 很 多样 来 之前 就 一直...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        cus_id         comment_time comment_star  \\\n",
       "0         迷糊泰迪  2018-09-20 06:48:00    sml-str40   \n",
       "1        稱霸幼稚園  2018-09-22 21:49:00    sml-str40   \n",
       "2       爱吃的美美侠  2018-09-22 22:16:00    sml-str40   \n",
       "3        姜姜会吃胖  2018-09-19 06:36:00    sml-str40   \n",
       "4  forevercage  2018-08-24 17:58:00    sml-str50   \n",
       "\n",
       "                                         cus_comment kouwei huanjing fuwu  \\\n",
       "0  南信 算是 广州 著名 甜品店 吧 好几个 时间段 路过 都 是 座无虚席 看着 餐单 上 ...    非常好        好    好   \n",
       "1  中午 吃 完 了 所谓 的 早茶 回去 放下 行李 休息 了 会 就 来 吃 下午茶 了 服...     很好       很好   很好   \n",
       "2  冲刺 王者 战队 吃遍 蓉城 战队 有 特权 五月份 和 好 朋友 毕业 旅行 来 了 广州...     很好       很好   很好   \n",
       "3  都 说来 广州 吃 糖水 就要 来南信 招牌 姜撞奶 红豆 双皮奶 牛 三星 云吞面 一楼 ...    非常好       很好   很好   \n",
       "4  一直 很 期待 也 最 爱 吃 甜品 广州 的 甜品 很 丰富 很 多样 来 之前 就 一直...    非常好       很好   很好   \n",
       "\n",
       "   shopID  stars  year  month  weekday  hour  comment_len  \\\n",
       "0  518986    4.0  2018      9        3     6          184   \n",
       "1  518986    4.0  2018      9        5    21          266   \n",
       "2  518986    4.0  2018      9        5    22          341   \n",
       "3  518986    4.0  2018      9        2     6          197   \n",
       "4  518986    5.0  2018      8        4    17          261   \n",
       "\n",
       "                               cus_comment_processed  \n",
       "0  南信 算是 广州 著名 甜品店 吧 好几个 时间段 路过 都 是 座无虚席 看着 餐单 上 ...  \n",
       "1  中午 吃 完 了 所谓 的 早茶 回去 放下 行李 休息 了 会 就 来 吃 下午茶 了 服...  \n",
       "2  冲刺 王者 战队 吃遍 蓉城 战队 有 特权 五月份 和 好 朋友 毕业 旅行 来 了 广州...  \n",
       "3  都 说来 广州 吃 糖水 就要 来南信 招牌 姜撞奶 红豆 双皮奶 牛 三星 云吞面 一楼 ...  \n",
       "4  一直 很 期待 也 最 爱 吃 甜品 广州 的 甜品 很 丰富 很 多样 来 之前 就 一直...  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from matplotlib import pyplot as plt\n",
    "import jieba\n",
    "data = pd.read_csv('data.csv')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 构建标签值\n",
    "\n",
    "大众点评的评分分为1-5分，1-2为差评，4-5为好评，3为中评，因此我们把1-2记为0,4-5记为1,3为中评，对我们的情感分析作用不大，丢弃掉这部分数据，但是可以作为训练语料模型的语料。我们的情感评分可以转化为标签值为1的概率值，这样我们就把情感分析问题转为文本分类问题了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "#构建label值\n",
    "def zhuanhuan(score):\n",
    "    if score > 3:\n",
    "        return 1\n",
    "    elif score < 3:\n",
    "        return 0\n",
    "    else:\n",
    "        return None\n",
    "    \n",
    "#特征值转换\n",
    "data['target'] = data['stars'].map(lambda x:zhuanhuan(x))\n",
    "data_model = data.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 文本特征处理\n",
    "\n",
    "中文文本特征处理，需要进行中文分词，jieba分词库简单好用。接下来需要过滤停用词，网上能够搜到现成的。最后就要进行文本转向量，有词库表示法、TF-IDF、word2vec等，这篇文章作了详细介绍，推荐一波 https://zhuanlan.zhihu.com/p/44917421\n",
    "\n",
    "这里我们使用sklearn库的TF-IDF工具进行文本特征提取。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "26328    冲着 老字号 去 的 但 真的 不 咋 地 所有 的 东西 都 不是 先 做 的 点 了 煎...\n",
       "3457                 名过其实 味道 一般 全是 人 价格 也 贵 感觉 碗碟 都 洗 不 干净\n",
       "17834    香芋 西米 以前 系 北京路 个边 上班 落 左班 都 会同 同事 去 吃糖 上个星期 去 ...\n",
       "7904     南信 的 双皮奶 非常 滑 不是 很甜 可是 很嫩 非常 美味 还有 鲜虾 肠有 好多好多 ...\n",
       "3258     环境 真的 是 一般 可能 是 老店 的 关系 但 很 传统 双皮奶 奶味重 甜度 适中 粥...\n",
       "Name: cus_comment, dtype: object"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#切分测试集、训练集\n",
    "from sklearn.model_selection import train_test_split\n",
    "x_train, x_test, y_train, y_test = train_test_split(data_model['cus_comment'], data_model['target'], random_state=3, test_size=0.25)\n",
    "\n",
    "#引入停用词\n",
    "infile = open(\"stopwords.txt\",encoding='utf-8')\n",
    "stopwords_lst = infile.readlines()\n",
    "stopwords = [x.strip() for x in stopwords_lst]\n",
    "\n",
    "#中文分词\n",
    "def fenci(train_data):\n",
    "    words_df = train_data.apply(lambda x:' '.join(jieba.cut(x)))\n",
    "    return words_df\n",
    " \n",
    "x_train[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=3000, min_df=1,\n",
       "        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,\n",
       "        stop_words=['!', '\"', '#', '$', '%', '&', \"'\", '(', ')', '*', '+', ',', '-', '--', '.', '..', '...', '......', '...................', './', '.一', '记者', '数', '年', '月', '日', '时', '分', '秒', '/', '//', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', '://', '::', ';', '<', '=', '>', '>>', '?', '@'...3', '94', '95', '96', '97', '98', '99', '100', '01', '02', '03', '04', '05', '06', '07', '08', '09'],\n",
       "        strip_accents=None, sublinear_tf=False,\n",
       "        token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b', tokenizer=None, use_idf=True,\n",
       "        vocabulary=None)"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#使用TF-IDF进行文本转向量处理\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "tv = TfidfVectorizer(stop_words=stopwords, max_features=3000, ngram_range=(1,2))\n",
    "tv.fit(x_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 机器学习建模\n",
    "\n",
    "特征和标签已经准备好了，接下来就是建模了。这里我们使用文本分类的经典算法朴素贝叶斯算法，而且朴素贝叶斯算法的计算量较少。特征值是评论文本经过TF-IDF处理的向量，标签值评论的分类共两类，好评是1，差评是0。情感评分为分类器预测分类1的概率值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9275308869629356"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#计算分类效果的准确率\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.metrics import roc_auc_score, f1_score\n",
    "classifier = MultinomialNB()\n",
    "classifier.fit(tv.transform(x_train), y_train)\n",
    "classifier.score(tv.transform(x_test), y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8971584233148908"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#计算分类器的AUC值\n",
    "y_pred = classifier.predict_proba(tv.transform(x_test))[:,1]\n",
    "roc_auc_score(y_test,y_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "#计算一条评论文本的情感评分\n",
    "def ceshi(model,strings):\n",
    "    strings_fenci = fenci(pd.Series([strings]))\n",
    "    return float(model.predict_proba(tv.transform(strings_fenci))[:,1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "好评实例的模型预测情感得分为0.8638082706675478\n",
      "差评实例的模型预测情感得分为0.7856544482460911\n"
     ]
    }
   ],
   "source": [
    "#从大众点评网找两条评论来测试一下\n",
    "test1 = '很好吃，环境好，所有员工的态度都很好，上菜快，服务也很好，味道好吃，都是用蒸馏水煮的，推荐，超好吃' #5星好评\n",
    "test2 = '糯米外皮不绵滑，豆沙馅粗躁，没有香甜味。12元一碗不值。' #1星差评\n",
    "print('好评实例的模型预测情感得分为{}\\n差评实例的模型预测情感得分为{}'.format(ceshi(classifier,test1),ceshi(classifier,test2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看出，准确率和AUC值都非常不错的样子，但点评网上的实际测试中，5星好评模型预测出来了，1星差评缺预测错误。为什么呢？我们查看一下**混淆矩阵**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  46,  385],\n",
       "       [   8, 4984]], dtype=int64)"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.metrics import confusion_matrix\n",
    "y_predict = classifier.predict(tv.transform(x_test))\n",
    "cm = confusion_matrix(y_test, y_predict)\n",
    "cm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看出，**负类的预测非常不准**，433单准确预测为负类的只有15.7%，应该是由于**数据不平衡**导致的，模型的默认阈值为输出值的中位数。比如逻辑回归的输出范围为[0,1]，当某个样本的输出大于0.5就会被划分为正例，反之为反例。在数据的类别不平衡时，采用默认的分类阈值可能会导致输出全部为正例，产生虚假的高准确度，导致分类失败。\n",
    "\n",
    "处理样本不均衡问题的方法，首先可以选择调整阈值，使得模型对于较少的类别更为敏感，或者选择合适的评估标准，比如ROC或者F1，而不是准确度（accuracy）。另外一种方法就是通过采样（sampling）来调整数据的不平衡。其中欠采样抛弃了大部分正例数据，从而弱化了其影响，可能会造成偏差很大的模型，同时，数据总是宝贵的，抛弃数据是很奢侈的。另外一种是过采样，下面我们就使用过采样方法来调整。\n",
    "\n",
    "### 过采样（单纯复制）\n",
    "\n",
    "单纯的重复了反例，因此会过分强调已有的反例。如果其中部分点标记错误或者是噪音，那么错误也容易被成倍的放大。因此最大的风险就是对反例过拟合。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0    19916\n",
       "0.0     1779\n",
       "Name: target, dtype: int64"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['target'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "#把0类样本复制10次，构造训练集\n",
    "index_tmp = y_train==0\n",
    "y_tmp = y_train[index_tmp]\n",
    "x_tmp = x_train[index_tmp]\n",
    "x_train2 = pd.concat([x_train,x_tmp,x_tmp,x_tmp,x_tmp,x_tmp,x_tmp,x_tmp,x_tmp,x_tmp,x_tmp])\n",
    "y_train2 = pd.concat([y_train,y_tmp,y_tmp,y_tmp,y_tmp,y_tmp,y_tmp,y_tmp,y_tmp,y_tmp,y_tmp])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9049699937533463"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#使用过采样样本(简单复制)进行模型训练，并查看准确率\n",
    "clf2 = MultinomialNB()\n",
    "clf2.fit(tv.transform(x_train2), y_train2)\n",
    "y_pred2 = clf2.predict_proba(tv.transform(x_test))[:,1]\n",
    "roc_auc_score(y_test,y_pred2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 331,  100],\n",
       "       [ 637, 4355]], dtype=int64)"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#查看此时的混淆矩阵\n",
    "y_predict2 = clf2.predict(tv.transform(x_test))\n",
    "cm = confusion_matrix(y_test, y_predict2)\n",
    "cm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看出，即使是简单粗暴的复制样本来处理样本不平衡问题，负样本的识别率大幅上升了，变为77%，满满的幸福感呀~我们自己写两句评语来看看"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.3520907656917545"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ceshi(clf2,'排队人太多，环境不好，口味一般')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看出把0类别的识别出来了，太棒了~"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 过采样（SMOTE算法）\n",
    "\n",
    "SMOTE（Synthetic minoritye over-sampling technique,SMOTE），是在局部区域通过K-近邻生成了新的反例。相较于简单的过采样，SMOTE降低了过拟合风险，但同时运算开销加大\n",
    "\n",
    "对SMOTE感兴趣的同学可以看下这篇文章https://www.jianshu.com/p/ecbc924860af"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "#使用SMOTE进行样本过采样处理\n",
    "from imblearn.over_sampling import SMOTE\n",
    "oversampler=SMOTE(random_state=0)\n",
    "x_train_vec = tv.transform(x_train)\n",
    "x_resampled, y_resampled = oversampler.fit_sample(x_train_vec, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0    14920\n",
       "0.0     1348\n",
       "Name: target, dtype: int64"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#原始的样本分布\n",
    "y_train.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0    14920\n",
       "0.0    14920\n",
       "dtype: int64"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#经过SMOTE算法过采样后的样本分布情况\n",
    "pd.Series(y_resampled).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们经过插值，把0类数据也丰富为14923个数据了，这时候正负样本的比例为1:1，接下来我们用平衡后的数据进行训练，效果如何呢，好期待啊~"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9072990102028675"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#使用过采样样本(SMOTE)进行模型训练，并查看准确率\n",
    "clf3 = MultinomialNB()\n",
    "clf3.fit(x_resampled, y_resampled)\n",
    "y_pred3 = clf3.predict_proba(tv.transform(x_test))[:,1]\n",
    "roc_auc_score(y_test,y_pred3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 330,  101],\n",
       "       [ 601, 4391]], dtype=int64)"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#查看此时的准确率\n",
    "y_predict3 = clf3.predict(tv.transform(x_test))\n",
    "cm = confusion_matrix(y_test, y_predict3)\n",
    "cm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.23491893934922978"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#到网上找一条差评来测试一下情感评分的预测效果\n",
    "test3 = '糯米外皮不绵滑，豆沙馅粗躁，没有香甜味。12元一碗不值。'\n",
    "ceshi(clf3,test3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看出，使用SMOTE插值与简单的数据复制比起来，AUC率略有提高，实际预测效果也挺好\n",
    "\n",
    "### 模型评估测试\n",
    "\n",
    "接下来我们把3W条数据都拿来训练，数据量变多了，模型效果应该会更好"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#词向量训练\n",
    "tv2 = TfidfVectorizer(stop_words=stopwords, max_features=3000, ngram_range=(1,2))\n",
    "tv2.fit(data_model['cus_comment'])\n",
    "\n",
    "#SMOTE插值\n",
    "X_tmp = tv2.transform(data_model['cus_comment'])\n",
    "y_tmp = data_model['target']\n",
    "sm = SMOTE(random_state=0)\n",
    "X,y = sm.fit_sample(X_tmp, y_tmp)\n",
    "\n",
    "clf = MultinomialNB()\n",
    "clf.fit(X, y)\n",
    "\n",
    "def fenxi(strings):\n",
    "    strings_fenci = fenci(pd.Series([strings]))\n",
    "    return float(clf.predict_proba(tv2.transform(strings_fenci))[:,1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.28900092243477077"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#到网上找一条差评来测试一下\n",
    "fenxi('糯米外皮不绵滑，豆沙馅粗躁，没有香甜味。12元一碗不值。')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "只用到了简单的机器学习，就做出了不错的情感分析效果，知识的力量真是强大呀，666~\n",
    "### 后续优化方向\n",
    "\n",
    "- 使用更复杂的机器学习模型如神经网络、支持向量机等\n",
    "- 模型的调参\n",
    "- 行业词库的构建\n",
    "- 增加数据量\n",
    "- 优化情感分析的算法\n",
    "- 增加标签提取等\n",
    "- 项目部署到服务器上，更好地分享和测试模型的效果"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
